Information Extraction Tools for Portable Document Format
نویسندگان
چکیده
Interest in the new publishing phenomenon known as e-book has grown enormously in last few years. There are now at least 150 companies involved in various ways in the development of e-books. Despite this involvement the spread of e-books has not yet useful in implementation of digital libraries. The use of e-books of PDF format in the implementation of digital library requires a robust information extraction system. In this paper we survey ten extraction tools for extracting contents like text, images, tables fonts etc. from e-books of PDF format. We also compare information extraction tools on the basic of various factors.
منابع مشابه
A Curation Pipeline and Web-Services for PDF Documents
The continuous growth of the biomedical literature and the need to efficiently find and extract information from its content led to the development of various text mining tools. More recently, these tools started being integrated in user-friendly applications facilitating their use by expert database curators. However, these tools were mainly designed to extract information from text based docu...
متن کاملA Study of Information Extraction Tools for Online English Newspapers (PDF): Comparative Analysis
Information retrieval is the task of retrieving relevant and useful information from e-newspapers. Electronic newspapers are electronic replicas of traditional newspapers. E-newspapers are becoming increasingly popular because of the ease and convenience in accessing them. Newspapers are the source of timely information. These are the documents comprising news items and several independent info...
متن کاملTAO: System for Table Detection and Extraction from PDF Documents
Digital documents present knowledge in most areas of study, exchanging and communicating information in a portable way. To better use the knowledge embedded in an ever-growing information source, effective tools for automatic information extraction are needed. Tables are crucial information elements in documents of scientific nature. Most publications use tables to represent and report concrete...
متن کاملOntology-Based Information Extraction from PDF Documents with Xonto
Information extraction is of paramount importance in several real world applications in the areas of business, competitive and military intelligence because it enables to acquire information contained in unstructured documents and store them in structured forms. Unstructured documents have different internal encodings, one of the most diffused encoding is the visualization-oriented Adobe portab...
متن کاملResearch and Realization about Conversion Algorithm of PDF Format into PS Format
This paper firstly introduces the characteristics of PostScript document and PDF document as the basis, and proposes the necessity and the feasibility of the conversion from the PDF document format to the PostScript language program. Secondly, it studies the main algorithm and technology of the conversion process and realizes the information extraction for PDF document lastly, with achieving th...
متن کامل